Goto

Collaborating Authors

 comprehensive score


CEFW: A Comprehensive Evaluation Framework for Watermark in Large Language Models

arXiv.org Artificial Intelligence

Text watermarking provides an effective solution for identifying synthetic text generated by large language models. However, existing techniques often focus on satisfying specific criteria while ignoring other key aspects, lacking a unified evaluation. To fill this gap, we propose the Comprehensive Evaluation Framework for Watermark (CEFW), a unified framework that comprehensively evaluates watermarking methods across five key dimensions: ease of detection, fidelity of text quality, minimal embedding cost, robustness to adversarial attacks, and imperceptibility to prevent imitation or forgery. By assessing watermarks according to all these key criteria, CEFW offers a thorough evaluation of their practicality and effectiveness. Moreover, we introduce a simple and effective watermarking method called Balanced Watermark (BW), which guarantees robustness and imperceptibility through balancing the way watermark information is added. Extensive experiments show that BW outperforms existing methods in overall performance across all evaluation dimensions. We release our code to the community for future research. https://github.com/DrankXs/BalancedWatermark.


Research on Effectiveness Evaluation and Optimization of Baseball Teaching Method Based on Machine Learning

arXiv.org Artificial Intelligence

In modern physical education, data-driven evaluation methods have gradually attracted attention, especially the quantitative prediction of students' sports performance through machine learning model. The purpose of this study is to use a variety of machine learning models to regress and predict students' comprehensive scores in baseball training, so as to evaluate the effectiveness of the current baseball teaching methods and put forward targeted training optimization suggestions. We set up a model and evaluate the performance of students by collecting many characteristics, such as hitting times, running times and batting. The experimental results show that K-Neighbors Regressor and Gradient Boosting Regressor are excellent in comprehensive prediction accuracy and stability, and the R score and error index are significantly better than other models. In addition, through the analysis of feature importance, it is found that cumulative hits and cumulative runs are the key factors affecting students' comprehensive scores. Based on the results of this study, this paper puts forward some suggestions on optimizing training strategies to help students get better performance in baseball training. The results show that the data-driven teaching evaluation method can effectively support physical education and promote personalized and refined teaching plan design.


Research on Tibetan Tourism Viewpoints information generation system based on LLM

arXiv.org Artificial Intelligence

Tibet, ensconced within China's territorial expanse, is distinguished by its labyrinthine and heterogeneous topography, a testament to its profound historical heritage, and the cradle of a unique religious ethos. The very essence of these attributes, however, has impeded the advancement of Tibet's tourism service infrastructure, rendering existing smart tourism services inadequate for the region's visitors. This study delves into the ramifications of informational disparities at tourist sites on Tibetan tourism and addresses the challenge of establishing the Large Language Model (LLM) evaluation criteria. It introduces an innovative approach, the DualGen Bridge AI system, employing supervised fine-tuning techniques to bolster model functionality and enhance optimization processes. Furthermore, it pioneers a multi-structured generative results assessment framework. Empirical validation confirms the efficacy of this framework. The study also explores the application of the supervised fine-tuning method within the proprietary DualGen Bridge AI, aimed at refining the generation of tourist site information. The study's findings offer valuable insights for optimizing system performance and provide support and inspiration for the application of LLM technology in Tibet's tourism services and beyond, potentially revolutionizing the smart tourism industry with advanced, tailored information generation capabilities.


SuperCLUE-Math6: Graded Multi-Step Math Reasoning Benchmark for LLMs in Chinese

arXiv.org Artificial Intelligence

We introduce SuperCLUE-Math6(SC-Math6), a new benchmark dataset to evaluate the mathematical reasoning abilities of Chinese language models. SC-Math6 is designed as an upgraded Chinese version of the GSM8K dataset with enhanced difficulty, diversity, and application scope. It consists of over 2000 mathematical word problems requiring multi-step reasoning and providing natural language solutions. We propose an innovative scheme to quantify the reasoning capability of large models based on performance over problems with different reasoning steps. Experiments on 13 representative Chinese models demonstrate a clear stratification of reasoning levels, with top models like GPT-4 showing superior performance. SC-Math6 fills the gap in Chinese mathematical reasoning benchmarks and provides a comprehensive testbed to advance the intelligence of Chinese language models.


Comprehensive Score: Towards Efficient Local Search for SAT with Long Clauses

AAAI Conferences

It is widely acknowledged that stochastic local search (SLS) algorithms can efficiently find models of satisfiable formulae for the Boolean Satisfiability (SAT) problem. There has been much interest in studying SLS algorithms on random $k$-SAT instances. Compared to random 3-SAT instances which have special statistical properties rendering them easy to solve, random $k$-SAT instances with long clauses are similar to structured ones and remain very difficult. This paper is devoted to efficient SLS algorithms for random $k$-SAT instances with long clauses. By combining a novel variable property $subscore$ with the commonly used property $score$, we design a scoring function named {\it comprehensive score}, which is utilized to develop a new SLS algorithm called CScoreSAT. The experiments show that CScoreSAT outperforms state-of-the-art SLS solvers, including the winners of recent SAT competitions, by one to two orders of magnitudes on large random 5-SAT and 7-SAT instances. In addition, CScoreSAT significantly outperforms its competitors on random $k$-SAT instances for each $k=4,5,6,7$ from SAT Challenge 2012, which indicates its robustness.